Kenneth Tay
Oct 9, 2018
Goal: Demonstrate that you know how to do data analysis in R
Minimum requirements:
vec <- c("a", "b", "c")
vec## [1] "a" "b" "c"
vec[c(2,4)]## [1] "b" NA
classes <- list(quarter = "Fall 2018/19",
ID = c("STATS 32", "STATS 101", "STATS 200"),
credits = 12)
classes$ID## [1] "STATS 32" "STATS 101" "STATS 200"
classes[["credits"]]## [1] 12
A special type of list:
data(mtcars)
str(mtcars)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
str, summaryhead, tailnames, dim, nrow, ncoltablemean, median, sd, varfactorI want all the rows such that the value of the cyl column is equal to 2:
vehicles[vehicles$cyl == 2, ]df## A B
## 1 1 a
## 2 2 b
## 3 3 c
## 4 NA d
## 5 NA <NA>
df$A == 2## [1] FALSE TRUE FALSE NA NA
df[df$A == 2, ]## A B
## 2 2 b
## NA NA <NA>
## NA.1 NA <NA>
Fix 1: test that the value is not NA and is equal to 2
df[!is.na(df$A) & df$A == 2, ]## A B
## 2 2 b
Fix 2: use the which function
which(df$A == 2)## [1] 2
df[which(df$A == 2), ]## A B
## 2 2 b
E.g. Take the mean of c(1,3,NA).
mean(c(1,3,NA))## [1] NA
mean(c(1,3,NA), na.rm = TRUE)## [1] 2
ggplot2 (and the + syntax)“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey
## mpg weight cylinders
## 1 21.0 2.620 6
## 2 21.0 2.875 6
## 3 22.8 2.320 4
## 4 21.4 3.215 6
## 5 18.7 3.440 8
## 6 18.1 3.460 6
## 7 14.3 3.570 8
## 8 24.4 3.190 4
## 9 22.8 3.150 4
## 10 19.2 3.440 6
## 11 17.8 3.440 6
## 12 16.4 4.070 8
## 13 17.3 3.730 8
## 14 15.2 3.780 8
## 15 10.4 5.250 8
## 16 10.4 5.424 8
## 17 14.7 5.345 8
## 18 32.4 2.200 4
## 19 30.4 1.615 4
## 20 33.9 1.835 4
## 21 21.5 2.465 4
## 22 15.5 3.520 8
## 23 15.2 3.435 8
## 24 13.3 3.840 8
## 25 19.2 3.845 8
## 26 27.3 1.935 4
## 27 26.0 2.140 4
## 28 30.4 1.513 4
## 29 15.8 3.170 8
## 30 19.7 2.770 6
## 31 15.0 3.570 8
## 32 21.4 2.780 4
What is the distribution of cylinders in my dataset?
What is the distribution of miles per gallon in my dataset?
What is the relationship between mpg and weight?
What is the relationship between mpg and time?
Not so good…
Easier to see the trend
For each value of cylinder, what is the distribution of mpg like?
I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?
ggplot2ggplot2 packageggplot2 reference manualData: Dataset we are using for the plot
## mpg weight cylinders
## 1 21.0 2.620 6
## 2 21.0 2.875 6
## 3 22.8 2.320 4
## 4 21.4 3.215 6
## 5 18.7 3.440 8
## 6 18.1 3.460 6
## 7 14.3 3.570 8
## 8 24.4 3.190 4
## 9 22.8 3.150 4
## 10 19.2 3.440 6
Geometries: Visual elements used for our data
Geom: point
Aesthetics: Defines the data columns which affect various aspects of the geom
3 different aesthetics:
We can have more than one layer in a graphic.
=
+
Each layer contains (essentially):
ggplot2 code: take 1Making use of ggplot’s sensible defaults:
ggplot() +
geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_point(data = df, mapping = aes(x = cylinders, y = mpg))ggplot2 code: take 2Using jitter to avoid “overplotting”:
ggplot() +
geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_point(data = df, mapping = aes(x = cylinders, y = mpg),
position = "jitter")ggplot2 code: take 3When layers share attributes, we only have to type them once:
ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_boxplot() +
geom_point(position = "jitter")
Optional material
One graphic contains:
Behind the scenes, R may need to do some transformation on the dataset to make the graphic.
Sometimes we need to tweak the position of the geometric elements because they obscure each other.
Only 9 data points??
Much better
Default colors
Manually chosen colors
Default axis limits
Manually chosen axis limits
Refers to all non-data ink
ggplot2’s default theme
Minimal theme
Classic theme
Dark theme
rgb(0,0,1), rgb(1,0,0), rgb(0,0,0), rgb(1,1,1)